Adding 16bpp Pixels
I wrote this article as a response to Rawhed's article in Hugi #16 on the topic. I want to present a better way of adding 16bpp pixels. First I will show two routines, one that does simple adding with saturation and the other which creates an average of pixels. Then I will draw a description of these routines.
Routines
The two routines start in the same way, but end up differently. By right of each instruction I marked a clock cycle in which it is issued. '-' means that an instruction is issued, due to pairing, in the same cycle as the previous one. The cycles correspond to Pentium processor.
The routines process two pairs of pixels in parallel, as described later. Registers esi and edi both contain pointers to pixels. A resulting pair of pixels is put at [edi]. Registers eax, ecx, edx and ebx get destroyed.
; Load pixels and decompose them
mov eax, [esi] ; 1
mov ecx, [edi] ; -
mov edx, eax ; 2, cache hit assumed
mov ebx, ecx ; -
and eax, 07E0F81Fh ; 3
and ecx, 07E0F81Fh ; -
and edx, 0F81F07E0h ; 4
and ebx, 0F81F07E0h ; -
; Sum up the pixels ; Calculate average
add eax, ecx ; 5 add eax, ecx ; 5
add edx, ebx ; - rcr eax, 1 ; 6
add edx, ebx ; -
; Check carries and saturate edx rcr edx, 1 ; 7
sbb ecx, ecx ; 6
xor ebx, ebx ; - ; Compose and store pixels
and ecx, 0F8000000h ; 7 or eax, edx ; 8
test edx, 00200000h ; - mov [edi], eax ; 9
setz bl ; 8
or edx, ecx ; 9
xor ecx, ecx ; -
dec ebx ; 10
test edx, 00000800h ; -
setz cl ; 11
and ebx, 001F0000h ; 12
dec ecx ; -
or edx, ebx ; 13
and ecx, 000007E0h ; -
or edx, ebx ; 14
; Check carries and saturate eax
xor ecx, ecx ; -
and edx, 0F81F07E0h ; 15
test eax, 08000000h ; -
setz cl ; 16
xor ebx, ebx ; 17
dec ecx ; -
and ecx, 07E00000h ; 18
test eax, 00010000h ; -
setz bl ; 19
or eax, ecx ; -
xor ecx, ecx ; 20
dec ebx ; -
and ebx, 0000F800h ; 21
test eax, 00000020h ; -
setz cl ; 22
or eax, ebx ; 23
dec ecx ; -
and ecx, 0000001Fh ; 24
and eax, 07E0F81Fh ; -
or eax, ecx ; 25
; Compose and store pixels
or eax, edx ; 26
mov [edi], eax ; 27
In the end, the procedure that adds pixels with saturation (left) takes 27 clocks, and as it processes pixels in parallel, it takes 13.5 cycles per pixel. The other procedure that calculates average of two pixels (right) takes 9 clocks, with 4.5 clock per pixel.
How does it work
Generally, pixels are formed in the following way: 5 bits for red color component, 6 for green and 5 for blue. An example layout:
15 0
0010110011000110
Àred-Àgren-Àblu-
As a pixel takes exactly 16 bits, two pixels fit in a 32-bit register. So if we have two pixels from image A in register eax, and two pixels from image B (of corresponding position) in register ecx, we can simply do 'add eax, ecx' to get the two sums of two pixel pairs. The only problem are carries, which are added to adjacent components. So we just smartly split the components of the pixels, do the addition, then either saturate or shift it right, and compose the pixels back. See the example:
31 0
eax from image A: 00101100110001100110001111011111
ecx from image B: 10010011011010000001000001011111
Ãred-Àgren-Àblu´Ãred-Àgren-Àblu´
À----pixel-2----À----pixel-1----
Code that splits the pixels:
; copy the pixels
mov edx, eax
mov ebx, ecx
; mask off even components
and eax, 00000111111000001111100000011111b ; 07E0F81Fh
and ecx, 07E0F81Fh
; mask off odd components
and edx, 11111000000111110000011111100000b ; 0F81F07E0h
and ebx, 0F81F07E0h
And contents of registers resulting from the splitting:
31 0
eax: -----100110-----01100------11111
edx: 00101------00110-----011110-----
ecx: -----011011-----00010------11111
ebx: 10010------01000-----000010-----
Ãred-Àgren-Àblu´Ãred-Àgren-Àblu´
À----pixel-2----À----pixel-1----
'-' indicates 0 put by the splitting routine with 'and'.
The rest is simple. Maybe the carry checking in the saturation code is the only thing that requires explanation. Watch this:
xor ebx, ebx ; ebx = 00000000h
test eax, 00200000h ; zf=1 if bit21 is clear, otherwise zf=0
setz bl ; ebx=1 if bit21=0, otherwise ebx=0
dec ebx ; ebx=0 if bit21=0, otherwise ebx=0FFFFFFFFh
and ebx, 001F0000h ; mask off pix2.red (bit21 was the carry)
or eax, ebx ; set pix2.red to max value if overflowed
This way of carry checking has the advantage of taking 5 clocks with no branch. Intermixed with other code theoretically takes 3.5 clocks. A branch version takes 2 (no carry, predicted) to 10 (carry, mispredicted) with an average of 6 clocks on Pentium and 1 to 18(!) with an average of 9 clocks on Pentium II, thus this no-branch version is better.
I do not have any idea whether the given routines are the best. I only know that on Pentium the adding with saturation takes 13.5 clocks per pixel and the pixel average takes 4.5 clocks per pixel.